Skip to content

Phase 1 changes for MCS Firstboot Pivot Failure Reporting#6029

Draft
CourtCourt521 wants to merge 2 commits into
openshift:mainfrom
CourtCourt521:FirstBootPivotErrorReporting
Draft

Phase 1 changes for MCS Firstboot Pivot Failure Reporting#6029
CourtCourt521 wants to merge 2 commits into
openshift:mainfrom
CourtCourt521:FirstBootPivotErrorReporting

Conversation

@CourtCourt521
Copy link
Copy Markdown
Contributor

@CourtCourt521 CourtCourt521 commented May 12, 2026

- What I did

Testing not an actual pull request

- How to verify it

Testing not an actual pull request

- Description for the changelog

Testing not an actual pull request

Testing not an actual pull request

Summary by CodeRabbit

  • New Features

    • Added --mcs-url flag to configure the Machine Config Server base URL.
    • Added a node-failure reporting endpoint that accepts firstboot failure reports (best-effort reporting).
  • Chores

    • Updated server startup to propagate MCS URL and include it in node annotations.
    • Updated deployment manifests to pass the MCS URL to the server.
  • Tests

    • Added tests for the node-failure handler and failure reporter behavior.

@openshift-merge-bot
Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 12, 2026
@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 12, 2026

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci
Copy link
Copy Markdown
Contributor

openshift-ci Bot commented May 12, 2026

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: CourtCourt521
Once this PR has been reviewed and has the lgtm label, please assign umohnani8 for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 12, 2026

Walkthrough

Adds an MCS URL option and propagates it from CLI/operator into bootstrap/cluster servers so node annotations include the URL; introduces FailureReporter interface, cluster/bootstrap implementations, a NodeFailureHandler HTTP endpoint, and tests; wires reporters into API server construction.

Changes

MCS URL + Node Failure Reporting

Layer / File(s) Summary
CLI flags and API wiring
cmd/machine-config-server/bootstrap.go, cmd/machine-config-server/start.go, pkg/server/api.go, pkg/server/api_test.go
Adds --mcs-url persistent flags, passes mcsURL into server constructors, creates NodeFailureHandler and registers /v1/node-failure on the API server; updates tests and mock server signatures.
FailureReporter implementations and tests
pkg/server/failure_reporter.go, pkg/server/failure_reporter_test.go
Adds FirstbootFailureReport and FailureReporter interface; implements cluster-mode Event-based reporter and bootstrap-mode logger reporter; adds unit tests exercising NodeFailureHandler behaviors including validation, accepted response, and reporter-error best-effort handling.
Server interface and node-annotation plumbing
pkg/server/server.go, pkg/server/cluster_server.go, pkg/server/bootstrap_server.go
Extends Server with GetKubeClient(), updates NewClusterServer/NewBootstrapServer to accept and store mcsURL, threads mcsURL into WithNodeAnnotations/getNodeAnnotation, and conditionally adds the MachineConfigServerURL annotation when mcsURL is non-empty.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 11 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 46.43% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title accurately describes the main objective: it introduces phase 1 infrastructure for MCS (Machine Config Server) firstboot pivot failure reporting, matching the core changes across all modified files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed This PR uses standard Go testing.T, not Ginkgo. No Ginkgo test patterns (It(), Describe(), etc.) were found. All test names are static and deterministic with no dynamic values.
Test Structure And Quality ✅ Passed The custom check requires Ginkgo test assessment, but this PR's tests use standard Go testing with testify. Not applicable.
Microshift Test Compatibility ✅ Passed No Ginkgo e2e tests were added in this PR. The check targets Ginkgo e2e tests (g.It, g.Describe, etc.), but only standard Go unit tests (testing package) were modified. Check is not applicable.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No Ginkgo e2e tests are added in this PR. All test modifications are standard Go unit tests using testing.T framework. The custom check does not apply.
Topology-Aware Scheduling Compatibility ✅ Passed PR does not introduce new scheduling constraints. Only adds MCS URL parameter and failure reporting. Pre-existing nodeSelector/tolerations unchanged. No topology-specific assumptions.
Ote Binary Stdout Contract ✅ Passed PR does not modify the OTE binary or its dependencies. Changes only affect cmd/machine-config-server and pkg/server. OTE Stdout Contract check is not applicable.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR contains only standard Go unit tests, not Ginkgo e2e tests. Tests use httptest mocking and don't make external calls or assume IPv4. Check is inapplicable.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
cmd/machine-config-server/bootstrap.go (1)

34-34: ⚡ Quick win

Consider validating the --mcs-url flag format.

Similar to the start command, the bootstrap command accepts any string value for --mcs-url without validating it's a well-formed URL. Adding basic URL validation when the flag is non-empty would improve robustness and prevent malformed URLs from being written to node annotations.

✨ Suggested validation approach
 func runBootstrapCmd(_ *cobra.Command, _ []string) {
 	flag.Set("logtostderr", "true")
 	flag.Parse()
 
+	if bootstrapOpts.mcsURL != "" {
+		if _, err := url.Parse(bootstrapOpts.mcsURL); err != nil {
+			klog.Exitf("--mcs-url is not a valid URL: %v", err)
+		}
+	}
+
 	// To help debugging, immediately log version
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/machine-config-server/bootstrap.go` at line 34, The bootstrap command
currently accepts any string for bootstrapOpts.mcsURL via
bootstrapCmd.PersistentFlags().StringVar; add validation so if
bootstrapOpts.mcsURL is non-empty you parse it (e.g., url.Parse) and ensure it
has a valid scheme (http or https) and a non-empty host, returning an error from
the command (RunE or a PreRun validation) when invalid; update the bootstrap
command to perform this check before proceeding to avoid storing malformed URLs
in node annotations.
pkg/server/server_test.go (1)

188-195: ⚡ Quick win

Consider adding explicit assertions for the mcsURL annotation.

The tests now pass a non-empty mcsURL to getNodeAnnotation, which should add the MachineConfigServerURLAnnotationKey to the generated annotations. While the tests validate the overall ignition structure (which includes the annotations file), they don't explicitly assert that the new annotation is present with the expected value.

Adding explicit assertions would improve test clarity and make it easier to catch regressions in this specific feature.

💡 Example assertion approach

After decoding the annotations JSON, add:

var annotations map[string]string
err = json.Unmarshal([]byte(anno), &annotations)
require.NoError(t, err)

assert.Equal(t, "https://api-int.test.example.com:22623", 
    annotations[daemonconsts.MachineConfigServerURLAnnotationKey],
    "MCS URL annotation should be present")

Also applies to: 383-390

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/server/server_test.go` around lines 188 - 195, Add explicit assertions
that the annotations returned by getNodeAnnotation include the
MachineConfigServerURLAnnotationKey with the expected mcsURL value; after
obtaining anno (used before appendFileToIgnition and referenced as mcIgnCfg),
unmarshal anno into a map[string]string and assert that
annotations[daemonconsts.MachineConfigServerURLAnnotationKey] equals
"https://api-int.test.example.com:22623". Apply the same check for the other
occurrence around lines 383-390 to ensure both test cases validate the presence
and value of the MCS URL annotation.
cmd/machine-config-server/start.go (1)

34-34: ⚡ Quick win

Consider validating the --mcs-url flag format.

The flag accepts any string value without validating it's a well-formed URL. While the operator typically generates this URL from infrastructure status, manual CLI invocation could provide a malformed URL that would be written to node annotations and cause status reporting failures.

Consider adding basic URL validation when the flag is non-empty to improve robustness.

✨ Suggested validation approach
 func runStartCmd(_ *cobra.Command, _ []string) {
 	flag.Set("logtostderr", "true")
 	flag.Parse()
 
+	if startOpts.mcsURL != "" {
+		if _, err := url.Parse(startOpts.mcsURL); err != nil {
+			klog.Exitf("--mcs-url is not a valid URL: %v", err)
+		}
+	}
+
 	// Initialize controller-runtime logger before any controller-runtime components are used
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/machine-config-server/start.go` at line 34, Add basic URL validation for
the --mcs-url flag value (startOpts.mcsURL) after it's parsed and before it's
used in status reporting: if startOpts.mcsURL is non-empty, parse it (e.g.,
using net/url's Parse or ParseRequestURI) and ensure it has a valid scheme
(http/https) and non-empty host; if validation fails return an error (or exit)
with a clear message so the process doesn't write a malformed URL to node
annotations. Place this check in the command initialization/run logic that
executes after startCmd.PersistentFlags().StringVar is parsed so it rejects
invalid values early.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pkg/operator/bootstrap.go`:
- Around line 150-157: The call to getIgnitionHost can panic because it indexes
APIServerInternalIPs[0] without verifying the slice length; update the
getIgnitionHost function to check that platform-specific APIServerInternalIPs
(or equivalent IP lists) are non-empty before indexing and return a clear error
if empty, and keep the existing error return path in the bootstrap caller (the
call site in bootstrap.go already checks err). Ensure the check covers all
platforms handled inside getIgnitionHost and that the returned error message
identifies the missing internal IP so the bootstrap render path can fail
gracefully instead of panicking.

---

Nitpick comments:
In `@cmd/machine-config-server/bootstrap.go`:
- Line 34: The bootstrap command currently accepts any string for
bootstrapOpts.mcsURL via bootstrapCmd.PersistentFlags().StringVar; add
validation so if bootstrapOpts.mcsURL is non-empty you parse it (e.g.,
url.Parse) and ensure it has a valid scheme (http or https) and a non-empty
host, returning an error from the command (RunE or a PreRun validation) when
invalid; update the bootstrap command to perform this check before proceeding to
avoid storing malformed URLs in node annotations.

In `@cmd/machine-config-server/start.go`:
- Line 34: Add basic URL validation for the --mcs-url flag value
(startOpts.mcsURL) after it's parsed and before it's used in status reporting:
if startOpts.mcsURL is non-empty, parse it (e.g., using net/url's Parse or
ParseRequestURI) and ensure it has a valid scheme (http/https) and non-empty
host; if validation fails return an error (or exit) with a clear message so the
process doesn't write a malformed URL to node annotations. Place this check in
the command initialization/run logic that executes after
startCmd.PersistentFlags().StringVar is parsed so it rejects invalid values
early.

In `@pkg/server/server_test.go`:
- Around line 188-195: Add explicit assertions that the annotations returned by
getNodeAnnotation include the MachineConfigServerURLAnnotationKey with the
expected mcsURL value; after obtaining anno (used before appendFileToIgnition
and referenced as mcIgnCfg), unmarshal anno into a map[string]string and assert
that annotations[daemonconsts.MachineConfigServerURLAnnotationKey] equals
"https://api-int.test.example.com:22623". Apply the same check for the other
occurrence around lines 383-390 to ensure both test cases validate the presence
and value of the MCS URL annotation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: f1f35220-fad5-4aec-94b8-29e7a568fbbc

📥 Commits

Reviewing files that changed from the base of the PR and between 99cb8a4 and 49e978d.

📒 Files selected for processing (12)
  • cmd/machine-config-server/bootstrap.go
  • cmd/machine-config-server/start.go
  • manifests/bootstrap-pod-v2.yaml
  • manifests/machineconfigserver/daemonset.yaml
  • pkg/daemon/constants/constants.go
  • pkg/operator/bootstrap.go
  • pkg/operator/render.go
  • pkg/operator/sync.go
  • pkg/server/bootstrap_server.go
  • pkg/server/cluster_server.go
  • pkg/server/server.go
  • pkg/server/server_test.go

Comment thread pkg/operator/bootstrap.go
Comment on lines +150 to +157
ignitionHost, err := getIgnitionHost(&dependencies.Infrastructure.Status)
if err != nil {
return nil, err
}
mcsURL := fmt.Sprintf("https://%s", ignitionHost)

config := getRenderConfig("", dependencies.KubeAPIServerServingCA, spec,
&imgs.RenderConfigImages, dependencies.Infrastructure, nil, nil, "2")
&imgs.RenderConfigImages, dependencies.Infrastructure, nil, nil, mcsURL, "2")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Guard getIgnitionHost against empty platform IP lists before bootstrap render.

Line 150 introduces a new bootstrap-time call path to getIgnitionHost, but that helper indexes APIServerInternalIPs[0] for some platforms without a length check. Empty lists will panic the render path.

Proposed hardening
--- a/pkg/operator/sync.go
+++ b/pkg/operator/sync.go
@@
 	if infraStatus.PlatformStatus != nil {
 		switch infraStatus.PlatformStatus.Type {
 		case configv1.BareMetalPlatformType:
-			ignitionHost = net.JoinHostPort(infraStatus.PlatformStatus.BareMetal.APIServerInternalIPs[0], securePortStr)
+			if infraStatus.PlatformStatus.BareMetal != nil && len(infraStatus.PlatformStatus.BareMetal.APIServerInternalIPs) > 0 {
+				ignitionHost = net.JoinHostPort(infraStatus.PlatformStatus.BareMetal.APIServerInternalIPs[0], securePortStr)
+			}
 		case configv1.OpenStackPlatformType:
-			ignitionHost = net.JoinHostPort(infraStatus.PlatformStatus.OpenStack.APIServerInternalIPs[0], securePortStr)
+			if infraStatus.PlatformStatus.OpenStack != nil && len(infraStatus.PlatformStatus.OpenStack.APIServerInternalIPs) > 0 {
+				ignitionHost = net.JoinHostPort(infraStatus.PlatformStatus.OpenStack.APIServerInternalIPs[0], securePortStr)
+			}
 		case configv1.OvirtPlatformType:
-			ignitionHost = net.JoinHostPort(infraStatus.PlatformStatus.Ovirt.APIServerInternalIPs[0], securePortStr)
+			if infraStatus.PlatformStatus.Ovirt != nil && len(infraStatus.PlatformStatus.Ovirt.APIServerInternalIPs) > 0 {
+				ignitionHost = net.JoinHostPort(infraStatus.PlatformStatus.Ovirt.APIServerInternalIPs[0], securePortStr)
+			}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/operator/bootstrap.go` around lines 150 - 157, The call to
getIgnitionHost can panic because it indexes APIServerInternalIPs[0] without
verifying the slice length; update the getIgnitionHost function to check that
platform-specific APIServerInternalIPs (or equivalent IP lists) are non-empty
before indexing and return a clear error if empty, and keep the existing error
return path in the bootstrap caller (the call site in bootstrap.go already
checks err). Ensure the check covers all platforms handled inside
getIgnitionHost and that the returned error message identifies the missing
internal IP so the bootstrap render path can fail gracefully instead of
panicking.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cmd/machine-config-server/start.go`:
- Around line 52-55: The code calls server.NewClusterServer and logs errors with
ctrlcommon.WriteTerminationError but continues execution, risking a nil
dereference when calling cs.GetKubeClient(); update the error handling in the
NewClusterServer call (and the similar block around cs.GetKubeClient() at the
second occurrence) to fail fast after ctrlcommon.WriteTerminationError by
returning or exiting (e.g., return an error from start or call os.Exit(1)) so
that cs is never dereferenced when NewClusterServer returns an error.

In `@pkg/server/api.go`:
- Around line 48-53: NewAPIServer currently calls mux.Handle("/v1/node-failure",
failureHandler) which will panic if failureHandler is nil; update NewAPIServer
to guard this by checking if failureHandler != nil before registering, and if it
is nil register a safe fallback (e.g., http.NotFoundHandler() or a small handler
that returns 404/503) or skip registration entirely so ServeMux.Handle is never
called with a nil value; adjust the code around NewAPIServer,
NodeFailureHandler, and the mux.Handle call accordingly.

In `@pkg/server/failure_reporter.go`:
- Around line 51-70: The ReportFailure code is treating any error from
r.kubeclient.CoreV1().Events(...).Get as "event missing" and attempts Create,
which hides real API/RBAC/network errors; update the logic in ReportFailure to
check the Get error with metav1.IsNotFound(err) (or errors.IsNotFound) and only
proceed to create the Event when IsNotFound returns true, otherwise return or
propagate the original Get error (and log it) instead of attempting Create;
refer to the existing Get call and the Create/Update branches around
existingEvent, metav1.IsNotFound, and the
r.kubeclient.CoreV1().Events(...).Create/Update calls to locate where to change
control flow.
- Around line 112-126: The sanitizeForEventName function currently only replaces
":", "_", and "." and doesn't enforce Kubernetes event name rules; update
sanitizeForEventName to (1) map any character not in [a-z0-9-] to '-' (use
lowercase first), (2) collapse consecutive '-' into a single '-', (3) trim
leading/trailing '-' so the name starts/ends with an alphanumeric, and (4)
enforce the Kubernetes DNS label length limit by truncating to 63 chars
(ensuring trimming again if truncation creates a trailing '-'); this will
guarantee names match the [a-z0-9]([-a-z0-9]*[a-z0-9])? pattern.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository: openshift/coderabbit/.coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 9871f01b-840a-47c3-8257-a57c71726745

📥 Commits

Reviewing files that changed from the base of the PR and between 49e978d and 57fde61.

📒 Files selected for processing (9)
  • cmd/machine-config-server/bootstrap.go
  • cmd/machine-config-server/start.go
  • pkg/server/api.go
  • pkg/server/api_test.go
  • pkg/server/bootstrap_server.go
  • pkg/server/cluster_server.go
  • pkg/server/failure_reporter.go
  • pkg/server/failure_reporter_test.go
  • pkg/server/server.go
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/server/bootstrap_server.go

Comment on lines +52 to 55
cs, err := server.NewClusterServer(startOpts.kubeconfig, startOpts.apiserverURL, startOpts.mcsURL)
if err != nil {
ctrlcommon.WriteTerminationError(err)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Exit after NewClusterServer error to avoid nil dereference.

When NewClusterServer returns an error, execution continues and cs.GetKubeClient() can panic. Fail fast after writing the termination error.

Suggested fix
 	cs, err := server.NewClusterServer(startOpts.kubeconfig, startOpts.apiserverURL, startOpts.mcsURL)
 	if err != nil {
 		ctrlcommon.WriteTerminationError(err)
+		klog.Exitf("failed to initialize cluster server: %v", err)
 	}

Also applies to: 60-61

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cmd/machine-config-server/start.go` around lines 52 - 55, The code calls
server.NewClusterServer and logs errors with ctrlcommon.WriteTerminationError
but continues execution, risking a nil dereference when calling
cs.GetKubeClient(); update the error handling in the NewClusterServer call (and
the similar block around cs.GetKubeClient() at the second occurrence) to fail
fast after ctrlcommon.WriteTerminationError by returning or exiting (e.g.,
return an error from start or call os.Exit(1)) so that cs is never dereferenced
when NewClusterServer returns an error.

Comment thread pkg/server/api.go
Comment on lines +48 to 53
func NewAPIServer(a *APIHandler, failureHandler *NodeFailureHandler, p int, is bool, c, k string, t *tls.Config) *APIServer {
mux := http.NewServeMux()
mux.Handle("/config/", a)
mux.Handle("/healthz", &healthHandler{})
mux.Handle("/v1/node-failure", failureHandler) // NEW
mux.Handle("/", &defaultHandler{})
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Guard against nil failureHandler in NewAPIServer.

http.ServeMux.Handle panics with a nil handler. NewAPIServer should tolerate nil (or explicitly fail) instead of panicking during initialization.

Suggested fix
 func NewAPIServer(a *APIHandler, failureHandler *NodeFailureHandler, p int, is bool, c, k string, t *tls.Config) *APIServer {
 	mux := http.NewServeMux()
 	mux.Handle("/config/", a)
 	mux.Handle("/healthz", &healthHandler{})
-	mux.Handle("/v1/node-failure", failureHandler) // NEW
+	if failureHandler != nil {
+		mux.Handle("/v1/node-failure", failureHandler)
+	}
 	mux.Handle("/", &defaultHandler{})
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func NewAPIServer(a *APIHandler, failureHandler *NodeFailureHandler, p int, is bool, c, k string, t *tls.Config) *APIServer {
mux := http.NewServeMux()
mux.Handle("/config/", a)
mux.Handle("/healthz", &healthHandler{})
mux.Handle("/v1/node-failure", failureHandler) // NEW
mux.Handle("/", &defaultHandler{})
func NewAPIServer(a *APIHandler, failureHandler *NodeFailureHandler, p int, is bool, c, k string, t *tls.Config) *APIServer {
mux := http.NewServeMux()
mux.Handle("/config/", a)
mux.Handle("/healthz", &healthHandler{})
if failureHandler != nil {
mux.Handle("/v1/node-failure", failureHandler)
}
mux.Handle("/", &defaultHandler{})
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/server/api.go` around lines 48 - 53, NewAPIServer currently calls
mux.Handle("/v1/node-failure", failureHandler) which will panic if
failureHandler is nil; update NewAPIServer to guard this by checking if
failureHandler != nil before registering, and if it is nil register a safe
fallback (e.g., http.NotFoundHandler() or a small handler that returns 404/503)
or skip registration entirely so ServeMux.Handle is never called with a nil
value; adjust the code around NewAPIServer, NodeFailureHandler, and the
mux.Handle call accordingly.

Comment on lines +51 to +70
// Check if event already exists
existingEvent, err := r.kubeclient.CoreV1().Events(namespace).Get(ctx, eventName, metav1.GetOptions{})

now := metav1.NewTime(time.Now())

if err == nil {
// Event exists, update it (increment count, update timestamp and message)
existingEvent.Count++
existingEvent.LastTimestamp = now
existingEvent.Message = formatFailureMessage(report)

_, updateErr := r.kubeclient.CoreV1().Events(namespace).Update(ctx, existingEvent, metav1.UpdateOptions{})
if updateErr != nil {
klog.Errorf("Failed to update firstboot failure event %s: %v", eventName, updateErr)
return updateErr
}
klog.Infof("Updated firstboot failure event for pool=%s node=%s (count=%d)",
report.Pool, report.NodeID, existingEvent.Count)
} else {
// Event doesn't exist, create it
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Handle non-NotFound errors from Event GET explicitly.

ReportFailure currently treats any Get error as “event missing” and tries Create. For RBAC/API/network failures, this masks the real error and can produce misleading follow-up errors.

Suggested fix
+import apierrors "k8s.io/apimachinery/pkg/api/errors"
...
-	existingEvent, err := r.kubeclient.CoreV1().Events(namespace).Get(ctx, eventName, metav1.GetOptions{})
+	existingEvent, err := r.kubeclient.CoreV1().Events(namespace).Get(ctx, eventName, metav1.GetOptions{})
...
-	if err == nil {
+	if err == nil {
 		// Event exists, update it (increment count, update timestamp and message)
 		...
-	} else {
+	} else if apierrors.IsNotFound(err) {
 		// Event doesn't exist, create it
 		...
+	} else {
+		klog.Errorf("Failed to get firstboot failure event %s: %v", eventName, err)
+		return err
 	}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/server/failure_reporter.go` around lines 51 - 70, The ReportFailure code
is treating any error from r.kubeclient.CoreV1().Events(...).Get as "event
missing" and attempts Create, which hides real API/RBAC/network errors; update
the logic in ReportFailure to check the Get error with metav1.IsNotFound(err)
(or errors.IsNotFound) and only proceed to create the Event when IsNotFound
returns true, otherwise return or propagate the original Get error (and log it)
instead of attempting Create; refer to the existing Get call and the
Create/Update branches around existingEvent, metav1.IsNotFound, and the
r.kubeclient.CoreV1().Events(...).Create/Update calls to locate where to change
control flow.

Comment on lines +112 to +126
func sanitizeForEventName(nodeID string) string {
// Replace invalid characters with '-'
// Kubernetes event names must match: [a-z0-9]([-a-z0-9]*[a-z0-9])?
sanitized := strings.ToLower(nodeID)
sanitized = strings.ReplaceAll(sanitized, ":", "-")
sanitized = strings.ReplaceAll(sanitized, "_", "-")
sanitized = strings.ReplaceAll(sanitized, ".", "-")

// Truncate if needed
if len(sanitized) > 200 {
sanitized = sanitized[:200]
}

return sanitized
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Sanitization can still produce invalid Kubernetes event names.

Current sanitization misses characters outside :_. and doesn’t enforce start/end alphanumeric constraints. Some node IDs can still fail Event creation.

Suggested fix
 func sanitizeForEventName(nodeID string) string {
-	sanitized := strings.ToLower(nodeID)
-	sanitized = strings.ReplaceAll(sanitized, ":", "-")
-	sanitized = strings.ReplaceAll(sanitized, "_", "-")
-	sanitized = strings.ReplaceAll(sanitized, ".", "-")
+	sanitized := strings.ToLower(nodeID)
+	b := strings.Builder{}
+	for _, r := range sanitized {
+		if (r >= 'a' && r <= 'z') || (r >= '0' && r <= '9') || r == '-' {
+			b.WriteRune(r)
+		} else {
+			b.WriteRune('-')
+		}
+	}
+	sanitized = b.String()
+	sanitized = strings.Trim(sanitized, "-")
+	if sanitized == "" {
+		sanitized = "unknown-node"
+	}
 
 	if len(sanitized) > 200 {
 		sanitized = sanitized[:200]
+		sanitized = strings.TrimRight(sanitized, "-")
 	}
 
 	return sanitized
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
func sanitizeForEventName(nodeID string) string {
// Replace invalid characters with '-'
// Kubernetes event names must match: [a-z0-9]([-a-z0-9]*[a-z0-9])?
sanitized := strings.ToLower(nodeID)
sanitized = strings.ReplaceAll(sanitized, ":", "-")
sanitized = strings.ReplaceAll(sanitized, "_", "-")
sanitized = strings.ReplaceAll(sanitized, ".", "-")
// Truncate if needed
if len(sanitized) > 200 {
sanitized = sanitized[:200]
}
return sanitized
}
func sanitizeForEventName(nodeID string) string {
// Replace invalid characters with '-'
// Kubernetes event names must match: [a-z0-9]([-a-z0-9]*[a-z0-9])?
sanitized := strings.ToLower(nodeID)
b := strings.Builder{}
for _, r := range sanitized {
if (r >= 'a' && r <= 'z') || (r >= '0' && r <= '9') || r == '-' {
b.WriteRune(r)
} else {
b.WriteRune('-')
}
}
sanitized = b.String()
sanitized = strings.Trim(sanitized, "-")
if sanitized == "" {
sanitized = "unknown-node"
}
// Truncate if needed
if len(sanitized) > 200 {
sanitized = sanitized[:200]
sanitized = strings.TrimRight(sanitized, "-")
}
return sanitized
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pkg/server/failure_reporter.go` around lines 112 - 126, The
sanitizeForEventName function currently only replaces ":", "_", and "." and
doesn't enforce Kubernetes event name rules; update sanitizeForEventName to (1)
map any character not in [a-z0-9-] to '-' (use lowercase first), (2) collapse
consecutive '-' into a single '-', (3) trim leading/trailing '-' so the name
starts/ends with an alphanumeric, and (4) enforce the Kubernetes DNS label
length limit by truncating to 63 chars (ensuring trimming again if truncation
creates a trailing '-'); this will guarantee names match the
[a-z0-9]([-a-z0-9]*[a-z0-9])? pattern.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant